ENH: Add use_nullable_dtypes in csv internals #48403

phofl · 2022-09-05T18:57:54Z

xref ENH: add option to get nullable dtypes to pd.read_csv #36712 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This adds the required casting logic for extension array dtypes. Want to get this done as a first step before adding a public option to read_csv. The unit tests can be used as a guideline for the required behaviour.

This always casts to ea dtypes, even when no missing values are present. From a users perspective this makes the most sense imo. You don't want to do a follow up operation and end up with float in a pipeline, when you set nullable dtypes (reindexing, enlargement in indexing ops, ...). That said, if there are concerns about this, we could also do something like

use_nullable_dtypes=False | always | when necessary

cc @jorisvandenbossche

jbrockmendel · 2022-09-06T18:18:14Z

pandas/_libs/parsers.pyx

@@ -15,6 +15,13 @@ import warnings

 from pandas.util._exceptions import find_stack_level

+from pandas import StringDtype
+from pandas.core.arrays import (


is there a viable way to do this outside of the cython code?

A possible option might be to change _maybe_upcast to optionally return a values array + mask array, and then do the actual construction in python?

Although since _maybe_upcast is only being called in this file (and thus from cython), that won't help. Unless if we would propagate such a (values, mask) tuple (instead of ArrayLike) through the different calls and return that from the TextReader.read method.

yah not worth contorting ourselves to avoid non-cython imports. just if there's a convenient alternative

Could be done yes, but would make logic more complex, which is something we should avoid here I think

yah, never mind then. thanks for taking a look

…e_dtypes_in_maybe_upcast

pandas/tests/io/parser/test_upcast.py

…e_dtypes_in_maybe_upcast

pandas/tests/io/parser/test_upcast.py

mroeschke

LGTM

…e_dtypes_in_maybe_upcast

mroeschke · 2022-09-19T23:14:01Z

Thanks @phofl

* ENH: Add use_nullable_dtypes in csv internals * Add tests * Fix mypy * Add comment * Add pyarrow test * Fix float32 * Fix float32 * Add contextmanager

phofl added 2 commits September 5, 2022 20:39

ENH: Add use_nullable_dtypes in csv internals

0f370f0

Add tests

f5e2015

phofl added IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 5, 2022

Fix mypy

373a17b

jbrockmendel reviewed Sep 6, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into enh_add_use_nullabl…

197c796

…e_dtypes_in_maybe_upcast

mroeschke reviewed Sep 12, 2022

View reviewed changes

pandas/tests/io/parser/test_upcast.py Outdated Show resolved Hide resolved

mroeschke reviewed Sep 12, 2022

View reviewed changes

pandas/tests/io/parser/test_upcast.py Outdated Show resolved Hide resolved

phofl added 5 commits September 12, 2022 20:27

Add comment

baa5310

Merge remote-tracking branch 'upstream/main' into enh_add_use_nullabl…

7a106b4

…e_dtypes_in_maybe_upcast

Add pyarrow test

709e068

Fix float32

4100ad0

Fix float32

757f113

mroeschke reviewed Sep 13, 2022

View reviewed changes

pandas/tests/io/parser/test_upcast.py Outdated Show resolved Hide resolved

Add contextmanager

9e85b39

mroeschke approved these changes Sep 13, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Sep 13, 2022

phofl added 2 commits September 13, 2022 18:59

Merge remote-tracking branch 'upstream/main' into enh_add_use_nullabl…

77b3715

…e_dtypes_in_maybe_upcast

Merge remote-tracking branch 'upstream/main' into enh_add_use_nullabl…

326378f

…e_dtypes_in_maybe_upcast

mroeschke merged commit 1273bc9 into pandas-dev:main Sep 19, 2022

phofl deleted the enh_add_use_nullable_dtypes_in_maybe_upcast branch September 20, 2022 08:23

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add use_nullable_dtypes in csv internals #48403

ENH: Add use_nullable_dtypes in csv internals #48403

phofl commented Sep 5, 2022 •

edited

Loading

jbrockmendel Sep 6, 2022

jorisvandenbossche Sep 6, 2022

jbrockmendel Sep 6, 2022

phofl Sep 6, 2022 •

edited

Loading

jbrockmendel Sep 6, 2022

mroeschke left a comment

mroeschke commented Sep 19, 2022

ENH: Add use_nullable_dtypes in csv internals #48403

ENH: Add use_nullable_dtypes in csv internals #48403

Conversation

phofl commented Sep 5, 2022 • edited Loading

jbrockmendel Sep 6, 2022

Choose a reason for hiding this comment

jorisvandenbossche Sep 6, 2022

Choose a reason for hiding this comment

jbrockmendel Sep 6, 2022

Choose a reason for hiding this comment

phofl Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

jbrockmendel Sep 6, 2022

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Sep 19, 2022

phofl commented Sep 5, 2022 •

edited

Loading

phofl Sep 6, 2022 •

edited

Loading